1 Structure

1.1 Column types

# 
# numeric 
#      29

1.2 Missing values, NAs, and Negatives

# [1] "Are all rows complete?: TRUE"
# [1] "Are there any NAs?: FALSE"
# [1] "Are any values negative?: FALSE"

1.3 Compress data if dimensions are too big.

If \(d > 100\) we reduce the number of columns using CUR decomposition. And if \(n > 1e3\) we reduce the number of rows using CUR decoposition.
Other methods are available.

2 Heatmap

The heatmap below is a representation of the data with values shown in color according to magnitude. Mouse hover for column names.

Heatmap

3 Violin Plot with jittered points

The violin plot combines a kernel density estimate with a boxplot for a more detailed vizualization. A jittered scatter plot of the points is overlaid. The jittering helps reduce effects of overplotting.

  1. Box plots
  2. Kernel Density Estimate
  3. Violin Plots

4 1D Heatmap

For each feature column, the data are binned and a heatmap is produced with each bin colored according to count.

5 Correlation Plot

The correlation between two random variables is a measure of a specific type of dependence that involves not only the two variables themselves but also a random component. It measures to what degree a linear relationship exists between then two random variables, where 1 is corresponds to a direct linear relationship, 0 corresponds to no linear relationship, and -1 corresponds to an inverse linear relationship.

  1. Correlation
  2. Correlation and dependence
  3. Example graphic

6 Outlier Plots

An outlier is a datapoint that lives relatively far away from the bulk of other observations. Outliers can have unwanted effects on data analysis and therefore should be considered carefully.

We use the built-in method from the randomForest package in R.

  1. randomForest
  2. Outlier

7 Cumulative Variance

The variance measure how spread out the data are from their mean. Cumulative variance measures, as a percentage, how much variation each dimension contributes to the dataset.

In this implementation we use principal components analysis to select linear combinations of the features that explain the dataset best in low dimensions.

The plot below shows how much variance is explained when adding columns one at a time. The elbows denote good “cut-off” points for dimension selection.

  1. Variance
  2. PCA
  3. Elbows

8 Cumulative sum (Correlation Matrix)

9 Pairs Plots

A pairs plot is a popular way of plotting high-dimensional data.
For every pair of dimensions are plotted showing the specific projection of the data along those two dimensions.

For readability a maximum of 8 dimensions are plotted.

10 BIC Plots

11 Mclust classifications

12 Binary Hierarhical Mclust classifications

13 3D pca of correlation matrix

14 Jittered Scatter Plot with classifications